Creative Computers

home *** CD-ROM | disk | FTP | other *** search

/ Creative Computers / Creative Computers CD-ROM, Volume 1 (Legendary Design Technologies, Inc.)(1994).iso / text / info / 68040_specs.pp / 68040_specs

Wrap

Text File | 1994-11-17 | 12KB | 279 lines

Captured from the bix network: ========================== microbytes/features #243, from microbytes, 11743 chars, Mon Jan 22 16:33:53 1990 -------------------------- TITLE: FIRST IMPRESSION: Motorola's New 68040 Microprocessor by Tom Thompson --------------------------- This new CISC microprocessor offers RISC performance --------------------------- Motorola has officially unwrapped its newest 32-bit microprocessor, the 68040. Manufactured with 0.8-micron high-speed CMOS technology, the 68040 packs 1.2 million transistors on a single silicon die. With 900,000 extra transistors to work with over the 300,000 transistors in a 68000 processor, the 68040's designers added new features and boosted performance. New features include the following: -- Optimized 68030 integer unit. While retaining object-code compatibility with previous 68000-family processors, the IU has been optimized to execute instructions in fewer clock cycles (i.e., run faster). The claimed boost in performance is three times that of a 68030. -- Integral FPU. The 68020 and 68030 require external FPU coprocessor chips to handle floating-point math. The 68040, however, has an FPU built into it, giving it the power to do serious number crunching. The FPU's data types are compatible with the ANSI/IEEE 754 standard for binary floating-point math, and its instruction set is object code-compatible with Motorola's 68881/68882 FPUs. Like the IU, the 68040's on-chip FPU has been optimized to execute frequently used instructions using fewer clock cycles. The claimed performance boost is 10 times that of a 68882. -- Large caches. Processor accesses to the system bus are minimized by storing the most recently used set of instructions or data in on-chip, 4K-byte caches. Both caches operate independently but can be accessed at the same time. Bus snoop logic is used to maintain cache coherency (i.e., it ensures that the cache's contents match those parts of memory corresponding to the cache). The bus snooper's design is fined-tuned to support multiprocessor systems where one or more bus masters or 68040s might share the same section of memory. -- Separate memory units for instructions and data. Each memory unit consists of a memory management unit, a cache controller, and bus snoop logic. The MMUs use a subset of the 68030's MMU instruction set. Both memory units function independently of each other to improve processor throughput. The 68040 ships with an initial clock speed of25 MHz; higher speeds are to be available in the future, Motorola says. The 68040 comes in a 179-pin grid-array package. With the elimination of coprocessor function lines (now that the MMU and FPU are consolidated onto the processor) and the addition of snoop control lines, the 68040 is not pin-compatible with the 68030. Because of the 68040's software compatibility with its predecessors, it can tap into the existing software base of 680x0 applications. It does this not only while eliminating a component (the FPU) from a computer's design, but also while improving performance. In fact, the 68040 executes instructions on the average of nearly once per clock cycle -- the same as a RISC processor. Fine-Tuned for Performance The 68040 was built on the firm foundation of its predecessors. The design team used the experience garnered from developing earlier processors to aid in optimizing the throughput of the 040. The 040 was designed from the ground up, Motorola engineers said. It incorporates a high degree of parallelism using a number of internal buses. An internal Harvard architecture gives the processor full access to both instructions and data. Both the IU and FPU have separate pipelines and can operate concurrently. For example, the FPU can perform floating-point instructions independently of the IU. Each stream (instructions or data) has its own dedicated cache and MU that function independently of each other. A smart bus controller assigns priorities to bus traffic to and from the caches. There were several key areas where Motorola was able to boost performance. The first was in reducing the clock cycles needed to execute certain instructions. The next was to ensure that the processor funnels instructions and data into itself quickly and constantly, lest it stall while waiting on information. The processor then gets its results back into the system without interfering with incoming information. Finally, as if this wasn't enough, the processor stays off the system bus to a greater extent than is the case with other processor designs. This lets DMA transfers and other bus masters have use of it. CISC with the Speed of RISC The IU was optimized so that high-usage instructions execute in fewer clock cycles, particularly branch instructions. Motorola said it performed thousands of code traces using real-world applications to determine which instructions were used most often.The IU consists of 6 stages: instruction prefetch, decode, effective address calculation, operand fetch, execution, and writeback (i.e., the result is written to either a register or to memory). Each stage works concurrently on the instruction pipeline. Dual prefetch and decode units deal with the branch instructions: One set processes the instruction taken on the branch, and another processes the intruction not taken. In this way, no matter what the outcome, the IU has the net instruction decoded and ready to go without seriously disrupting the pipeline. This complex design has a big payoff: Motorola has determined that the average instruction takes 1.3 clock cycles to execute. The ability to execute an instruction once per clock cycle is the performance edge of RISC processors -- yet the 68040's IU accomplishes the same goal while executing complex-instruction-set computer (CISC) instructions. The FPU adds 11 registers to the 68040 register set: Eight of them are 80-bit floating-point registers, and three are status, control, and instruction address registers. The FPU has a three-stage execution unit, and, like the IU, each stage operates concurrently. Load and store instructions (FMOVE) can be performed during other arithmetic operations, and a 64- by 8-bit hardware multiplication unit speeds many calculations. However, the FPU only implements a subset of the 68882 instructions on-chip. The transcendental (trigonometric and exponential) functions are emulated in software via a software trap. But Motorola claims that even these instructions should execute 25% to 100% faster on 25-MHz 68040 than on a 33-MHz 68882 FPU. Boosting Throughput In the area of throughput, each stream is managed by a separate memory unit that uses an MMU for logical-to-physical address translations during bus accesses. These MMUs support demand-paged virtual memory. Both MMUs have a four-way set-associative address translation cache (ATC) with 4 entries (versus 22 entries for the 68030). The ATCs reduce processor overhead by storing the most recent address translations. When an address translation is required, the ATC is searched, and if it contains the address, it is used immediately. Otherwise, a combination of high-speed hardware logic and microcode searches the translation tables located in main memory. Like the PU, these MMUs implement a subset of the 68030's MMU instruction set. Gone are the PLOAD and PMOVE instructions, because enhanced existing instructions made them superfluous. Also, only 2 memory page sizes are supported, 4K and 8K bytes, whereas the 68030 MMU supported 8 page sizes ranging from 256 bytes to 32K bytes. A design trade-off was made here: A performance gain was possible by supporting only the 2 most common page sizes. In any case, this change impacts only operating-system code, since MMU instructions aren't normally used by applications. The two on-chip 4K caches improve processor throughput in 2 ways: They keep the pipelines filled and minimize system bus accesses. To see how this is done, you must examine the structure of the cache. Each is a four-way set-associative cache composed of 64 sets of four lines. A line consists of 4 longwords, or 16 bytes. Cache lines are read or written rapidly using burst